Modular content parser: YouTube + Instagram + Reddit#1
Draft
codeby wants to merge 33 commits into
Draft
Conversation
CLI tool that resolves search queries, channels, playlists, or video URLs to a list of videos, then fetches top-level comments (optionally with replies) via the YouTube Data API and transcripts via youtube-transcript-api. Writes per-video JSON + Markdown plus a summary CSV and index. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Browser-based form wraps the existing parser modules: queries, channels, playlists, and videos as separate tabs; sidebar holds API key and limits; runs stream live status into the page; results are downloadable as a single ZIP or as summary.csv. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Falls back to the YOUTUBE_API_KEY environment variable for local runs. Wraps st.secrets access in try/except so a missing secrets.toml does not crash the app locally. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
The Streamlit app now reads in Russian end-to-end. Added Save / Delete buttons next to the API key field that write the key to ~/.youtube_parser_config.json (chmod 600). Loading order on startup: st.secrets → $YOUTUBE_API_KEY → saved file. .gitignore added to keep caches, virtualenvs, the secrets file, and parser output out of git. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
The Save button now writes the key to both ~/.youtube_parser_config.json and .streamlit/secrets.toml so it is available globally and via st.secrets in the same Streamlit project. The TOML upsert preserves any other keys in the file and deletes the file if removing the key leaves it empty. Delete clears both locations. The status caption lists every place the key is saved. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
The 1.x release replaced the static YouTubeTranscriptApi.list_transcripts class method with an instance method (api.list / api.fetch). The old code silently failed for every video because the broad except returned None on the AttributeError, so the UI always reported "no transcript". Rewrite transcripts.py against the new API and switch to a verbose return shape so callers can distinguish disabled, missing, and blocked cases. Both the Streamlit app and the CLI now report the actual reason when no transcript is produced. Pinned the dependency to >=1.0.0. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
YouTube blocks transcript requests from datacenter IPs (Streamlit Cloud, GCP, AWS), surfacing as RequestBlocked. Add a proxy_config kwarg to the transcripts module and a sidebar section in the Streamlit app to choose Webshare (rotating residential proxies) or a generic HTTP proxy. Defaults are pulled from st.secrets (WEBSHARE_USERNAME, WEBSHARE_PASSWORD, PROXY_HTTP_URL, PROXY_HTTPS_URL) or environment variables, so creds set in the Streamlit Cloud dashboard load automatically. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Adds a source-agnostic core (schema.py with Item/Comment/Transcript dataclasses, plugin.py with the SourcePlugin ABC plus InputSpec/FieldSpec, registry.py, runner.py, secrets.py, output.py, errors.py) so additional sources can plug in alongside YouTube without touching the core. The existing YouTube modules move into content_parser/plugins/youtube/ with an adapter that converts API dicts into the new Item schema and a YouTubePlugin implementing the contract. The youtube_parser/sources.py, comments.py, and transcripts.py become one-line shims that re-export from the new location, so existing callers (app.py, youtube_parser.main) keep working unchanged. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
content_parser.cli exposes 'list-sources' and 'run --source ... --input KIND=VALUE --set KEY=VALUE'. Convenience aliases (--query, --channel, --video, --hashtag, --account, --post) and key=value setting overrides make scripted runs ergonomic. content_parser/ui/app.py renders the Streamlit interface from each plugin's input_specs() and settings_specs(), so adding a new source needs no UI changes. Sidebar manages secrets per plugin (load from st.secrets/env/config.json, save/clear buttons), the proxy block shows only when the active plugin has a proxy_provider setting. Root app.py is now a 3-line shim into content_parser.ui.app.main, so Streamlit Cloud picks up the new UI on next deploy. The legacy youtube_parser.main CLI keeps working unchanged via the back-compat shims introduced in the previous commit. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
InstagramPlugin handles three input kinds — hashtags, accounts, and direct post/reel URLs — and runs them in a single Apify actor call. The adapter maps Apify post fields (likesCount, videoViewCount, musicInfo, latestComments with nested replies) into the unified Item schema, with audio_id and audio_title surfaced under media for trend research. ApifyClient is a thin wrapper around run-sync-get-dataset-items with explicit handling of 401 (bad token) and 402 (out of credits). The plugin auto-registers via content_parser.core.registry, so the CLI and Streamlit UI pick it up without further changes — confirmed via 'python -m content_parser.cli list-sources'. Adds requests>=2.31.0 to requirements. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
- registry.py now distinguishes ImportError (optional dep missing — silent at DEBUG) from any other exception (typo, runtime bug — printed to stderr) so plugins no longer disappear without explanation. - runner.py wraps the fetch loop in try/finally; summary.csv and index.md are flushed even when fetch raises mid-iteration, so partial runs stay inspectable. The original exception is re-raised after. - secrets.py escapes backslashes and double quotes when writing values to .streamlit/secrets.toml, so a value containing a quote no longer produces a malformed TOML file that breaks st.secrets on next start. Verified with a mini test harness: TOML round-trips a value like 'a"b\\c' through tomllib, the runner produces summary.csv after a forced mid-loop crash, and registry.warns on a NameError while staying silent on a missing optional import. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…der auth
- Routing per input kind: hashtags + accounts go to one Apify call with
the user-chosen resultsType ('posts' by default). Explicit post/reel
URLs go to a second call with resultsType='details', since 'posts' on
a single-post URL returns nothing useful. The runner sees this as one
fetch generator yielding all results combined.
- _normalize_account refuses URLs whose first path segment is /p/, /reel/,
/explore/, etc. — those used to silently turn into a request for a
username like 'p', returning empty data with no clear error. Also
validates username characters against Instagram's allowed set.
- resolve() raises a PluginError if a value in the post_url field doesn't
look like /p/ or /reel/, so users catch the mistake before paying for
a useless Apify run.
- ApifyClient sends the token in the Authorization: Bearer header instead
of as a ?token= query string, so it doesn't leak into nginx access logs.
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
- youtube_parser/main.py is now a translation layer over content_parser.cli:
it parses the original argument set ('--query', '--video', '--max-comments',
'--include-replies', '--no-transcripts', etc.) and rewrites it into the new
'--source youtube --set key=value' form. Removes ~150 lines of duplicated
CLI logic that drifted away from the new output layout.
- ui/app.py _render_field now handles a 'select' widget with no options
and no default by falling back to a free-text input, so a misconfigured
FieldSpec doesn't crash the whole UI.
- .gitignore picks up .content_parser/ (saved-secrets dir) and
.pytest_cache/.
- tests/ adds 34 unittest cases (no extra dependency, runs with stdlib):
TOML upsert/escape/round-trip, runner partial-run safety, Instagram
account validation + per-kind dispatch + Apify Bearer auth, Apify
adapter field mapping, legacy CLI flag translation. Runs via
'python -m unittest discover -s tests'.
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Three input kinds: - subreddit (name or URL, with or without 'r/' prefix) - query (full-text search across all of Reddit) - post_url (specific thread for comment analysis) - user (posts by a given Redditor — competitor tracking) Settings cover the listing knobs (hot/top/new/rising/controversial), time_filter for top/controversial, max posts per input, comment collection (top-level only by default, mirroring the YouTube plugin), and an opt-in expand_more_comments flag for users who want the full tree at the cost of slower scrapes. The adapter maps PRAW Submission/Comment objects into the unified Item schema: score / upvote_ratio / num_comments / NSFW + locked / spoiler flags / external link domain go into media; awards and post_hint go into extra. Deleted authors render as "[deleted]" rather than None. Comments are flattened with parent_id linkage so the same Markdown renderer that handles YouTube replies works unchanged. Secrets needed: REDDIT_CLIENT_ID + REDDIT_CLIENT_SECRET (free, created at reddit.com/prefs/apps as a "script" app). REDDIT_USER_AGENT is optional with a sensible default. Adds 41 new tests (75 total) covering adapter field mapping, input normalization (subreddit/user prefixes, URL parsing), reject paths (invalid chars, listing URL in post_url field), comment depth + cap behavior, and PRAW listing dispatch via mocks. praw>=7.7 added to requirements.txt. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
- _file_stem now passes source and item_id through _safe_filename, not just title. Defense in depth against an upstream API returning a malicious id like '../../etc/passwd' that would have escaped the output directory. Verified by tests that hit write_item_json / write_item_markdown with traversal attempts and assert the resulting path stays under out_dir. - _is_reddit_post_url now matches host exactly (== 'reddit.com' or endswith '.reddit.com', same for redd.it). The previous substring check let 'evilreddit.com' and 'reddit.com.evil.example' through. Tests added for the lookalike rejection plus a positive case for legitimate subdomains like old.reddit.com. - build_reddit logs a WARNING when REDDIT_USER_AGENT is unset, before falling back to a generic default. Reddit's API rules ask for a username-bearing UA; the warning surfaces the misconfiguration that would otherwise just look like flaky rate limits. - Reddit fetch errors now go through _redact_spec, which strips query strings and caps length to 80 chars. Prevents accidentally pasting a URL with ?token=... into the field and seeing it echoed back through exception messages and Streamlit logs. - README.md adds a 'Sharing scraped results' section warning that comments are written to Markdown unescaped — fine for personal viewing, but raw output/ should not be republished without a sanitizer because of Markdown link injection vectors. - 19 new tests (94 total): _safe_filename behavior, _file_stem path traversal, write_item_* containment, _is_reddit_post_url lookalike rejection + subdomain acceptance, _redact_spec behavior, and build_reddit's logging assertion via patch.dict on sys.modules. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…ck, host symmetry
Should-fix items from the second review pass:
- _file_stem now appends a short sha256 prefix when item_id sanitizes to
the fallback ('item'), so two items whose ids both reduce to special
chars no longer clobber each other on disk.
- _redact_spec also strips URL fragments (#access_token=...) in addition
to query strings, since OAuth implicit-flow tokens travel there.
- build_reddit now treats whitespace-only REDDIT_USER_AGENT as missing
and falls back to the default with the WARNING log, instead of
silently passing whitespace through to PRAW.
- _normalize_subreddit and _normalize_user reject non-Reddit hosts when
given a URL, mirroring _is_reddit_post_url. Cosmetic — PRAW would
still hit api.reddit.com — but keeps validation symmetric.
Nice-to-haves while we're here:
- replace_more on expand_more=True is now hard-capped at 32 expansions
(constant _MAX_REPLACE_MORE) instead of unbounded. Unbounded calls
could pull thousands of comments and minutes of latency on big threads.
- 'rising' listing on a user (PRAW doesn't expose it) falls back to
'new' with an INFO log so the user sees why the result differs.
- _is_reddit_host extracted as a shared helper used by all three URL
validators.
8 new tests (102 total) cover stem collision avoidance, fragment
redaction, whitespace UA fallback, non-reddit host rejection in both
normalizers, replace_more cap, and the rising→new log.
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Three input kinds: - query: groups.search → wall.get for each found community - community: screen_name / club<id> / numeric / vk.com URL - post_url: vk.com/wall<owner>_<post> Settings cover the whole pipeline: max communities per query, max posts per wall (capped at VK's 100/call), fetch_comments toggle, max comments per post (paginated via wall.getComments offsets), and comment_depth top_level vs all (with thread_items_count=10 when 'all'). The adapter resolves author names via the profiles + groups arrays returned by extended=1 calls — no extra users.get / groups.getById roundtrips. Negative owner_ids correctly map to club<id>; positive ones to id<id>. Security carry-overs from the previous reviews: - VKClient sends access_token in the POST body, never query string. - VK error_code 5/17/27/28 → AuthError; 6/9/29 → RateLimitError; rest → PluginError. UI surfaces these distinctly. - _normalize_community and _extract_wall_id reject non-VK hosts (vk.com, vk.ru, m.vk.com only — substring match would let evilvk.com through). - _normalize_community rejects VK reserved paths (feed, im, video, etc.) that would otherwise look like screen names but aren't communities. - _redact_spec strips ?query and #fragment before logging. 47 new tests (149 total): adapter field mapping for posts/comments and user vs group label resolution, normalization (screen_name / club / URL / lookalike host / reserved path), wall ID extraction, _redact_spec, client error code mapping, token-not-in-URL invariant, and fetch dispatch for query/community/post including dedupe across mixed inputs. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…apter Should-fix items from the combined review: - _fetch_comments now checks the cap *before* every append (top-level AND reply), so depth=all on a thread with hundreds of replies no longer overshoots max_comments by one. Also short-circuits pagination using the response's `count` field instead of doing one extra round-trip just to see an empty page. - VKClient retries RateLimitError (codes 6/9/29) with exponential backoff (1s, 2s, 4s, ... up to max_rate_limit_retries=3 by default) before bubbling up. AuthError and other PluginErrors are not retried. _sleep is a static method so tests can patch it without timing flakes. - VKClient now uses a single requests.Session for the whole client lifetime, so we don't pay the TLS handshake on every API call. - post_to_item raises ValueError when owner_id or id is missing, instead of silently constructing item_id="0_0" which would collide across multiple malformed posts. - _collect_for_spec post-path no longer duplicates the group/profile-cache lookup that the adapter already does via _label_for_id; just appends (post, None) and lets the adapter resolve. Extracted the shared response-merging logic into _extract_extended. 9 new tests (158 total): retry-then-succeed, give-up-after-max-retries, auth-not-retried, top-level cap exact, depth=all overflow control, single-page short-circuit on count, multi-page pagination continues when count > page, adapter ValueError on missing fields. The earlier ClientErrorMappingTest cases were updated to patch requests.Session (not requests.post) since the client now uses a session. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Two input kinds:
- channel: @username, plain username, or t.me URL (parses recent messages)
- post_url: t.me/<channel>/<msg_id> for a specific post + its comments
Reuses APIFY_API_TOKEN from the Instagram plugin so a Streamlit Cloud
user only configures the Apify secret once. The default actor is
apify/telegram-channel-scraper but actor_id is exposed as a setting
so it can be swapped (e.g. 73code/telegram-scraper) without code edits.
The adapter is field-shape-defensive because different Telegram
scrapers on Apify use different key names: _pick walks a list of
likely keys, _reactions_total accepts a list of {emoji, count} dicts,
a flat {emoji: count} mapping, or just an int. Comments embedded in
the message dict (replies_data, comments, discussion, thread.items)
all parse to the same Comment list.
Security carry-overs from the prior reviews:
- _is_tg_host does exact-match on t.me / telegram.me to reject
evilt.me and t.me.evil.example
- _normalize_channel rejects Telegram reserved paths (joinchat, proxy,
iv, etc.) that would otherwise look like usernames
- _extract_post_url rejects /c/<chatid>/ private-channel paths since
the public scrapers cannot read them
- _redact_spec strips ?query and #fragment before logging
- post-fetch comment count is capped to max_comments_per_post even
when the actor returns more
49 new tests (207 total): _pick fallback chain, _reactions_total over
all three reaction shapes, message_to_item with primary and alt field
names, zero-views preserved, inline-comment extraction, alternative
field-name fallbacks, host validation lookalike rejection, reserved
path rejection, /c/ private path rejection, dispatch one-actor-call vs
two for mixed inputs, actor_id override, dedupe across channel+post,
and comment cap enforcement.
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…nt, reply tree
Should-fix items from the combined review:
- actor_id is validated against ^[A-Za-z0-9_-]+[/~][A-Za-z0-9_.-]+$.
Whitespace-only or unset falls back to the default actor cleanly;
garbage like 'noslash' or '/missing' raises PluginError up-front
instead of being sent to Apify and producing a confusing 404.
- fetch() does ONE pass through actor results: parses each message
to Item, dedupes by item_id in the same loop. The previous version
parsed each message twice (once for dedupe key, once for yield) —
doubles the adapter cost on big result sets.
- _replies_count handles both shapes: 'replies: 42' (number) and
'replies: [...]' (list of comment dicts → use len). Previously
number-only responses left media.comments_count as None.
- _extract_comments now also looks at the bare 'replies' field for
comment lists (not just replies_data/comments/discussion/thread).
- Reply tree linkage: when a comment has reply_to_message_id (or
replyToMessageId / reply_to_msg_id) and the parent is in the same
fetched batch, we set parent_id accordingly so the Markdown writer
can render the thread structure. Out-of-batch references stay top-level.
- _is_private_channel_url helper catches t.me/c/<chat_id>/... before
_extract_post_url returns None, raising an explicit PluginError that
tells the user the URL is private and Apify scrapers can't read it.
- _to_int defensively coerces numeric values, refusing to silently
store a stray dict (e.g. {'count': 100}) in media when an actor
uses an unexpected schema. Applied to views/forwards counts.
- Cosmetic: media_obj computed once instead of msg.get('media') twice.
25 new tests (232 total): _to_int across all input shapes including
the dict-leak guard, _replies_count for int/list/alt-keys,
reply_to_message_id parent linkage with both inside and outside-batch
references, dict-views does-not-leak, actor_id validation across
five garbage forms plus default fallback for empty/whitespace,
ApifyError → PluginError wrapping for both channels and posts paths,
private /c/ URL explicit error.
Also fixes a regression introduced in the previous edit pass where
_channel_label lost its def line and became a continuation of
_replies_count's body — caught by the test suite immediately.
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Pull values from a column or range of any Google Sheets spreadsheet and
drop them into the active plugin's input tab. Same loader code will be
called from the cron runner in the next step, so it's designed for both
interactive and headless use.
Auth uses a Google Cloud service account: paste the JSON key into the
GOOGLE_SHEETS_CREDENTIALS secret, then share each target spreadsheet
with the service account's email (visible via .service_account_email()
helper for UX hints).
Loader API (content_parser/loaders/gsheets.py):
loader = GoogleSheetsLoader.from_secrets({"GOOGLE_SHEETS_CREDENTIALS": ...})
loaded = loader.load(sheet_id_or_url, tab="Communities", range_a1="A:A",
skip_header=False)
loaded.values # ['durov_says', 'telegram', ...] — flattened, deduped, trimmed
loaded.sheet_title / loaded.tab_title / loaded.count
Sidebar block "📥 Загрузить из Google Sheets" exposes the same loader
under any plugin: paste creds, paste sheet URL, pick tab + range, pick
which input kind (channel / community / hashtag / etc.) to populate, hit
Загрузить. Loaded values append to the existing input field (preserving
manual entries), so several sheets can be merged before running.
Defensive behavior:
- credentials JSON is validated for type/client_email/private_key keys
before sending to gspread, with a clear AuthError if it's e.g. an
OAuth client JSON instead of a service account key.
- Sheet URL extraction tolerates the ID alone, the full /d/<id>/edit URL,
and trailing query params.
- A1 range validated against a permissive regex; an actual range error
from the API surfaces with the user's range echoed back.
- 403 from Google → AuthError with "share the sheet" hint. 404 →
PluginError with "check the URL/ID".
- Unknown tab name → PluginError listing the tab names that DO exist.
20 new tests (252 total): credentials validation across all four
malformed forms (non-JSON string, JSON-but-not-dict, missing field,
missing secret), sheet ID extraction (bare ID / full URL / URL with
query / garbage / empty), load() with single column / multi column /
deduplication / blank-skipping / skip_header / invalid range / unknown
tab / 403 / 404 / default-first-sheet.
requirements.txt: +gspread>=6.0, +google-auth>=2.20 (the latter was
already a transitive dep of google-api-python-client; pinning it
explicitly makes the loader self-contained for cron use later).
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Should-fix items: - _extract_sheet_id now validates the URL host strictly (must be docs.google.com). The previous regex.search would happily pull '/d/<id>/' out of any URL, including https://evil.com/.../d/<id>/... Not an SSRF (we don't fetch the user URL — the ID just becomes a parameter to the Google Sheets API), but the silent acceptance was misleading. Lookalike hosts and other google subdomains (mail.google.com etc.) are now rejected explicitly. - validate_credentials extracted as a static method that does the shape check WITHOUT building a gspread client. The save button now validates pasted JSON via this helper before persisting, so users see "JSON невалиден: …" immediately instead of saving garbage that fails on next load. - Service-account 'type' field is now checked too: an OAuth client JSON (type=authorized_user) is rejected with a message that points the user to the right kind of credential. - All UI buttons in this block translated to Russian (Сохранить / Удалить / ✏️ Заменить / ✕ Отмена) — was English-Russian mixed. Nice-to-haves while we're here: - After creds are saved, the field collapses to a one-line summary: "✓ Учётка сохранена: bot@project.iam.gserviceaccount.com" with a hint to share the spreadsheet with that email — addresses both the "where do I find this?" UX gap and the security concern of re-rendering the full RSA private key in plain text on every load. An ✏️ Заменить button reveals the textarea again. - A warning caption above the JSON field reminds the user that the JSON contains a private key. - st.spinner around the load call so the UI shows progress feedback. - Empty / whitespace 'tab' parameter falls back to the first sheet (matters for cron configs that may pass tab=""). - raw_rows dropped from LoadedRange — was populated but never read, carried unnecessary copies of full sheet data in memory. 8 new tests (260 total): non-google host rejected (with explicit docs.google.com hint in error), lookalike host rejected, other-google subdomain rejected (mail.google.com), OAuth client JSON rejected, validate_credentials does NOT call _build_client, empty/whitespace tab fallback to first sheet, raw_rows attribute removed from LoadedRange. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Job describes one scheduled run: source plugin, inputs (inline list and/or
Google Sheets references), settings, optional cron schedule. Job-files
live in ~/.content_parser/jobs/<name>.yaml and are read with yaml.safe_load
to keep the door closed on !!python/object construction tricks.
Schema (jobs/schema.py):
- Job dataclass with validate(): rejects bad names (regex
^[A-Za-z0-9_-]{1,64}$), missing source, invalid cron expressions, jobs
without any inputs, malformed sheet_inputs, unknown notify_on_failure.
- SheetInput dataclass mirrors GoogleSheetsLoader.load() args.
- is_valid_cron loosely accepts standard 5-token expressions and @-aliases
(@daily, @Weekly, ...). It refuses garbage like 'rm -rf /' that contains
characters outside [\d*/,-A-Za-z].
- resolved_output_dir() returns output/scheduled/<name>/<timestamp>/ by
default, an absolute output_dir as-is, or a relative one resolved
against cwd. The timestamp suffix is always appended.
Store (jobs/store.py):
- list_jobs() / load_job() / save_job() / delete_job() / job_exists().
- Path resolution validates the candidate is inside JOBS_DIR via
Path.resolve() + relative_to() — defense in depth even though the
job-name regex already keeps slashes out.
- list_invalid() returns (name, error) pairs for files that fail to
parse, so the UI can surface broken jobs instead of silently dropping.
- save_job sets chmod 600 (best-effort).
Runner (jobs/runner.py):
- run_job(name) loads the YAML and runs run_job_obj(job).
- _resolve_inputs merges inline values with Sheets-loaded values per
input kind, then dedupes preserving insertion order, then drops empty
kinds.
- _collect_secrets pulls plugin secret_keys + GOOGLE_SHEETS_CREDENTIALS
if any sheet_inputs present + the same WEBSHARE_/PROXY_ optional set
the CLI/UI uses.
- On success: writes .last_run.txt marker. On failure: writes
last_error.txt with traceback unless notify_on_failure='none'. The
original exception is re-raised so cron sees a non-zero exit.
48 new tests (308 total): cron expression validation across standard
and alias forms (and rejection of cmd-injection-shaped garbage), Job
validation across every guard (bad name / no source / no inputs /
invalid schedule / malformed sheet ref / unknown notify), YAML
round-trip + safe_load enforcement (rejects !!python/object), name_hint
fallback when YAML omits 'name', range vs range_a1 alt key,
resolved_output_dir for default/relative/absolute, store CRUD with
path-traversal rejection, list_jobs sorting + skip-invalid behavior,
chmod 600 on save, runner input merge with dedupe, secret collection
(plugin / sheets-needed / optional proxy), success/failure marker
writing, notify_on_failure=none suppresses error file, empty resolved
inputs raise PluginError before plugin is touched.
requirements.txt: +pyyaml>=6.0.
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
cron.py manages a marker-bounded block in the user's crontab without ever touching lines outside our markers: # >>> content_parser jobs >>> 0 6 * * MON cd /repo && python -m content_parser.cli jobs run weekly # job:weekly # <<< content_parser jobs <<< API: - install_cron(jobs=None, project_root=None, python_executable=None, log_path=None) — collects every job with a schedule, regenerates the managed block. Idempotent: running twice with the same jobs yields the same crontab. Existing user lines outside the markers are preserved. - remove_cron() — strips the block, returns True/False. - read_block() — best-effort parse of currently-installed entries (schedule, job_name, command). Safety: - shlex.quote on every path/argument that goes into the cron command, so even a hypothetical bad job name (which the schema regex already rejects) couldn't inject extra shell metacharacters. - Friendly errors for missing crontab binary and 'no crontab' state. CLI subcommand `jobs`: - jobs list → tabulated overview of all saved jobs + invalid files - jobs show <name> → dump a job's canonical YAML - jobs run <name> → invoke run_job() with stdout logging and progress - jobs install-cron → regenerate the managed block - jobs remove-cron → strip the managed block - jobs cron-status → show what's currently in the block 18 new tests (326 total): _strip_block leaves outside lines untouched and handles block-at-start, _build_block produces marker-wrapped lines with # job:<name> footer, build_command_for_job shell-quotes paths with spaces and uses safe paths as-is, install_cron idempotency across runs, jobs without schedule are skipped, lines outside markers preserved through reinstall, remove_cron only writes when block exists, read_block parses entries back, _existing_crontab returns "" for the "no crontab for user" case but raises on real errors and on missing binary. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
New '🕐 Расписание' section appears at the bottom of every plugin's page. Lets the user: 1. List existing jobs with collapsed details: source, schedule (or "ручной запуск" badge), description, inline inputs and Sheet refs summary. 2. Per job:▶️ Запустить (calls run_job_obj with live log), ✏️ Изменить (raw YAML editor with Save/Cancel), 🗑️ Удалить. 3. ➕ Создать job из текущего состояния — captures the current input tabs + plugin settings into a new YAML file. Bare-minimum form: name, optional cron, optional description; sheet_inputs added by editing the YAML afterward (since they need URL/tab/range fields). 4. 📅 Cron section, automatically grayed out on hosts without `crontab` binary (Streamlit Cloud) — there it shows a copy-paste GitHub Actions workflow as the alternative path. On hosts with crontab: install / remove buttons + summary of currently-installed entries. UI gracefully surfaces invalid YAML files via list_invalid() so a user who hand-edited a file and broke it can see the parse error instead of having the job silently disappear. is_cron_available() helper added to jobs/cron.py — runs a one-shot `crontab -l` and catches FileNotFoundError. UI calls it once per render to decide whether to show the install/remove buttons or the GH Actions template. Run button label updated to "▶️ Запустить (разово)" to disambiguate from the per-job▶️ buttons in the Schedule panel. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…friendly CLI Should-fix items: - inputs YAML parser now refuses non-list values per kind. The previous comprehension iterated strings character-by-character, so the typo 'community: durov_says' (no brackets) silently produced ['d','u','r',...]. Fix raises a clear PluginError before the typo can corrupt a run. - Job.validate() now rejects '..' anywhere in output_dir parts. Absolute paths still go through (user explicitly opts in), but the path-traversal case '../../etc' or 'custom/../escape' is caught at validation. - build_command_for_job rejects newlines and carriage returns in any path fragment (project_root, python_executable, log_path) and in job.name. shlex.quote happily preserves a literal \n inside its single-quoted output, which would split a crontab entry across two lines and corrupt the file. The schema's job-name regex already covers job.name, but the defense is added there too for future-proofing. - cli jobs run wraps run_job in try/except for AuthError, PluginError and KeyError (unknown source from get_plugin), printing a friendly stderr message and returning exit code 1 instead of dumping a Python traceback. - run_job_obj now computes resolved_output_dir() ONCE up-front. Earlier, a Sheets-load failure or empty-resolved-inputs would call job.resolved_output_dir() twice — once for the eventual run, again to pick a place for last_error.txt — producing two timestamped directories that differ by milliseconds. Now both markers land in the same dir. 12 new tests (338 total): output_dir rejected with .. at start / middle, absolute and normal-relative output_dirs accepted, string / int / dict values in inputs raise on YAML load (with the "must be a list" hint), empty input value treated as empty list, newline rejected in project_root / python_executable / log_path, carriage return rejected. The empty-resolved-inputs test in test_jobs_runner already verified the single-out_dir behavior end-to-end (passes with the refactor). https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
New content_parser/transcription/ module wires yt-dlp + OpenAI Whisper into the existing Item.transcript field. When the user enables 'transcribe_videos' in a plugin's settings, each Item with a video URL is downloaded as audio (MP3 64 kbps, well under the 25 MB Whisper API limit), shipped to api.openai.com/v1/audio/transcriptions, and the verbose_json segments are mapped onto the existing Transcript schema so the Markdown writer renders them the same way as YouTube subtitles. Module layout: - downloader.py — yt-dlp wrapper with FFmpegExtractAudio postprocessor and a 24 MB filesize cap. get_duration_seconds() probes without downloading for budget gating. - whisper_api.py — minimal Bearer-auth HTTP client (just `requests`, no `openai` package). Distinguishes 401 (bad key), 429 (rate limit), and other 4xx/5xx with the API's error message. - cache.py — ~/.content_parser/transcription_cache/<source>_<id>.json, so re-running a job doesn't re-pay for previously transcribed items. - runner.py — maybe_transcribe(item, settings, secrets, only_if_missing=) is the single entry point plugins call. Order: cache check → duration cap → download → API → cache write. Plugin integration: - Instagram, VK, Telegram add `transcribe_videos` (bool, default off) and `max_audio_seconds_per_video` (default 600) FieldSpecs and call maybe_transcribe inline in fetch(). - YouTube treats Whisper as a fallback: only_if_missing=True means it runs only when youtube-transcript-api couldn't return segments (subs disabled, blocked, etc.). Avoids wasting API on videos that already have free subtitles. UI: - Sidebar shows an inline 'Параметры Whisper' expander when transcribe_videos is checked, with OPENAI_API_KEY input + save/clear buttons + caption about the cost and ffmpeg requirement. - OPENAI_API_KEY is in the optional shared-secrets list, so a saved value is picked up across plugins and by the cron runner. Security carry-overs: - Token in Authorization: Bearer header, never URL. - _video_url_for prefers the canonical post URL (e.g. instagram.com/reel/AAA/) over CDN URLs in media.video_url, since CDN tokens often expire while yt-dlp can re-resolve from the post URL fresh. - Cache filenames go through _safe regex so a malicious upstream id like '../../etc' can't escape the cache dir. - Hard cap on audio duration before download blocks surprise costs. 24 new tests (362 total): cache CRUD with path-traversal sanitization; whisper_api Bearer header / verbose_json format / language passthrough / 401 / 429 / other-error message extraction / valid response parsing; maybe_transcribe disabled-by-setting / no-key-sets-error / cache-hit- skips-network / full-pipeline-downloads-and-caches / duration-cap-blocks / download-failure-recorded / whisper-failure-recorded / only_if_missing-skips-when-present / only_if_missing-runs-when-empty / no-video-url-silent / prefers-canonical-url-over-cdn. requirements.txt: +yt-dlp>=2024.0. ffmpeg required at runtime (documented in plugin help text). https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…ion pin Should-fix items: - runner.maybe_transcribe now refuses URLs that aren't public HTTP(S): loopback hostnames (localhost / 0.0.0.0), IPv4/IPv6 literals in private RFC1918 ranges, link-local (169.254.0.0/16 incl. AWS metadata), reserved and loopback. yt-dlp would otherwise happily fetch from internal networks if any third-party API (Apify/VK/Telegram actor) ever returned such a URL — chain-of-trust SSRF. Bare DNS names still pass since resolution happens later in yt-dlp; this layer only catches literals. - runner.maybe_transcribe blocks transcription when get_duration_seconds() returns None. Without a known length the per-video Whisper bill is unbounded; refusing is the cheap-and-safe default. Earlier code fell through this branch and would download anyway. - whisper_api.transcribe_audio retries 429 (rate limit) and 5xx (server error) up to max_retries=2 with exponential backoff (2s, 4s). 401/4xx other than those surface immediately. _sleep is a module-level helper so tests patch it without slowing the suite — TranscribeAudioTest's test_429_rate_limit was updated to use max_retries=0 for the no-retry semantic. Nice-to-haves: - yt-dlp pinned to >=2024.0,<2027.0 to bound supply-chain blast radius if a future major version ever ships a malicious extractor. - UI caption under Параметры Whisper now mentions that the saved key persists across checkbox toggles — only 🗑️ removes it. 17 new tests (379 total): _is_public_url across normal URLs, http variant, non-http schemes, localhost / 0.0.0.0 / 127.0.0.1 / IPv6 ::1, RFC1918 (10/172.16/192.168), link-local 169.254 (AWS metadata), IPv6 fc00::/7, empty/invalid input, DNS names pass through; runner blocks on private URL before any download; runner blocks on duration unknown; Whisper retry on 429-then-success / 500+503-then-success / exhausted retries; 401 and 400 do NOT retry (single call only). Existing test_429_rate_limit adjusted for new retry semantics. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Three cross-cutting issues from the full-project review, in one batch: 1. GitHub Actions tests workflow (.github/workflows/tests.yml). Runs on every push and PR-to-main against Python 3.11 and 3.12, installs requirements.txt, runs `python -m unittest discover -s tests -v`, and smoke-checks `cli list-sources`. No more silent regressions between manual reviews. 2. _redact_spec was reimplemented in three plugins (Reddit, VK, Telegram), each with the security-relevant job of stripping ?query and #fragment from URLs before they hit logs or exception messages. When we added fragment-stripping to Reddit, the others were missed for a release. Extracted to content_parser/core/redact.py as redact_spec(). All three plugins now import the single canonical implementation; tests import from core.redact too (aliased to _redact_spec locally to keep diffs small). 3. ApifyClient lived in plugins/instagram/apify_client.py and Telegram imported from there — runtime cross-plugin dependency that would silently break if Instagram were renamed or removed. Moved to content_parser/clients/apify.py (a new top-level package for shared HTTP clients). Both Instagram and Telegram now import from the shared module; Instagram's old file is deleted. Tests adjusted to patch the new module path (content_parser.clients.apify.requests.post). All 379 tests still pass after the moves. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
New 'instagram_graph' plugin alongside the existing public 'instagram'
(Apify) plugin. Different tool for different jobs:
- 'instagram' — публичные посты любого аккаунта (Apify, $$$)
- 'instagram_graph' — твои посты + insights (Meta Graph API, бесплатно)
What it gives that Apify can't:
- Insights — reach, impressions, plays, saved, shares, total_interactions
on your own Reels and posts
- Full comment threads with replies and like counts
- No per-item Apify cost
- Stable, Meta-supported endpoint
Files:
- plugins/instagram_graph/client.py — GraphClient over graph.facebook.com
with retry-on-429/5xx exponential backoff (2s, 4s), pagination via
paging.next URL walking, embedded-token replacement so a 'next' URL
can't smuggle a different token through, error-code mapping
(190/102/etc → AuthError; 10/200/803 → AuthError "permissions";
4/17/32/613 → RateLimitError).
- plugins/instagram_graph/adapter.py — media_to_item maps a Graph
media object (IMAGE/VIDEO/REEL/CAROUSEL_ALBUM) to core.Item with
insights flattened into media dict; flatten_comments folds inline
replies (replies.data) into the flat Comment list with parent_id.
- plugins/instagram_graph/plugin.py — InstagramGraphPlugin with two
inputs: 'account' (Business Account ID, 15-20 digits regex-validated)
and 'post_id' (numeric media ID). Settings: max_posts_per_account,
fetch_comments / fetch_replies / fetch_insights toggles,
max_comments_per_post, plus the standard transcribe_videos /
max_audio_seconds_per_video pair. Whisper integration via the same
maybe_transcribe call as other video plugins.
Plumbing:
- registry.py registers the new plugin alongside the existing five.
- jobs/runner.py adds INSTAGRAM_ACCESS_TOKEN to the optional secrets
list, so cron jobs pick it up automatically.
- ui/app.py shared-secrets list extended too.
Auth requirements (documented in plugin.py docstring):
Convert IG account to Business/Creator → connect to a FB Page →
create Meta Developer App → generate long-lived token via Graph API
Explorer with scopes instagram_basic, instagram_manage_comments,
pages_show_list, business_management → store as INSTAGRAM_ACCESS_TOKEN.
Insights are best-effort: if the /insights call returns a permissions
error (common on archived posts or older media), we swallow it and
continue with the rest of the run instead of dying.
42 new tests (421 total): client (token always overrides embedded ones,
401/code-10/code-4/5xx error mapping, retry-on-429-then-success, retry
exhaustion, pagination across pages, max_items early-exit, embedded-
token override on next URL); adapter (insights envelope flattening
across dict/list shapes, media_to_item field mapping for REEL +
non-REEL, owner_username override, missing-id raises, falls back
gracefully when owner_username not passed, comment_to_core for top
and reply, flatten_comments two-level expansion); plugin (resolve
validates account-id length and post-id format, dedupe across inputs,
fetch dispatch for account-path / post-path / mixed-with-dedupe,
fetch_insights=False skips /insights endpoint entirely, insights
failure does NOT abort the run).
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Should-fix items:
- GraphClient now scrubs the access token from any RequestException
message before raising. The `requests` library sometimes embeds the
full URL — including ?access_token=… — in connection-error messages,
which would otherwise propagate to last_error.txt / Streamlit logs /
CLI stderr. The exception is re-raised with `from None` so the chained
__cause__ doesn't keep the unredacted original around either.
- is_reel boolean now has explicit parens —
(media_type == "REEL") or (media_type == "VIDEO" and product_type=="REELS")
— instead of relying on Python's `and > or` precedence, which is easy
to misread.
- media_to_item accepts insights as a keyword argument instead of having
callers mutate `media["insights"]`. The plugin now passes the freshly-
fetched insights data through; media dict stays read-only.
- Stale comment about a non-existent _get_url helper replaced with
accurate description of what get_paginated actually does.
4 new tests (425 total): RequestException with the token in its message
gets [REDACTED] in the propagated PluginError; chained __cause__ is None
so the secret doesn't leak through traceback.format_exc; 5xx-then-5xx-
then-success retries with exponential backoff (mirrors the existing 429
test); insights metric selection differs for REEL vs IMAGE media types
(REEL gets plays+total_interactions, IMAGE gets impressions+reach+saved).
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…c cache, Cloud detection, status file
Six findings from the project-wide audit landed in one batch:
1. CSV formula injection guard (core/output.py). Excel/Sheets/LibreOffice
execute any CSV cell starting with =, +, -, @, \t, \r as a formula
(=cmd|'/c calc'!A1 is the canonical RCE proof-of-concept). User-
controlled fields like title and author can come straight from
Apify/Reddit/YouTube comments, which means any of our scrapes could
ship a CSV that runs shell commands when a non-technical viewer opens
it in Excel. _csv_safe prepends a single quote to neutralize the
formula while keeping the value visible. Applied to every string
column going into summary.csv.
2. Token redaction in last_error.txt (jobs/runner.py). The previous
implementation wrote `traceback.format_exc()` raw — and tracebacks
carry the chained exception's message, which can include API URLs
with ?access_token=… in the query (we redact at the source for
Instagram Graph but not in every other plugin's exception path).
_record_failure now scrubs every secret value it knows about (8+
chars only, to skip noise) before the file lands on disk. Both
call sites pass `secrets=secrets` from collect_secrets.
3. YouTube replies cap honored (plugins/youtube/comments.py). When
include_replies=True, fetch_comments used to call _fetch_all_replies
without bound — a single popular top-level comment with 500 replies
would return 1+500 items only for `comments[:max_comments]` to throw
most of them away. The fix threads `remaining = max_comments -
len(comments)` through to _fetch_all_replies, which now stops both
inside the inline loop and at page boundaries. Also requests page
sizes proportional to remaining quota.
4. Atomic transcription cache (transcription/cache.py). put() now
writes to <name>.json.tmp and renames over the final path. POSIX
guarantees rename atomicity, so a crash during the write leaves
either the old value or the new value, never a half-written JSON
that get() catches as ValueError and silently treats as cache miss.
5. Streamlit Cloud detection in secrets layer (core/secrets.py).
.streamlit/secrets.toml is managed by the Cloud dashboard and
read-only at the filesystem level. Detect via STREAMLIT_RUNTIME
env, STREAMLIT_SHARING, or HOSTNAME=streamlit-* and skip the file
write entirely — local config.json (the other write target) still
persists so the value works for the current container; users
mirror it via Settings → Secrets for next deployment.
6. Unified .last_status.json (jobs/runner.py). Both _record_success
and _record_failure now write a single canonical status file that
monitoring / UI can stat once for "is this job healthy?". Schema:
{job, source, status, finished_at, items, error}. Atomic write via
.tmp+replace as well.
17 new tests (442 total): _csv_safe across all five injection-prone
prefixes (= + - @ \t \r) and the safe-string / None / non-string
passthrough cases; an end-to-end summary.csv test that injects a
malicious title and verifies the round-tripped DictReader sees the
quoted form. record_failure-redaction (passes a secret value, expects
[REDACTED] in last_error.txt) and the new status-file shape. Cache
atomicity (no .tmp left after success, second put() replaces first).
Streamlit Cloud detection (STREAMLIT_RUNTIME=cloud → no file written;
empty env → file written normally).
https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
python -m content_parser.cli {list-sources, run --source ...}youtube_parser/keeps working via re-export shims; the legacy CLI is now a thin translation layer over the new oneWhat's where
Setup for Streamlit Cloud
Add to Settings → Secrets:
Security highlights (covered by tests)
_safe_filenameapplied tosourceanditem_id(defense in depth against malicious upstream IDs);_file_stemappends a short hash when sanitization collapses the id, so collisions can't clobber filesAuthorization: Bearer …header (not query string) so it doesn't leak into nginx logs\and"so secrets containing these characters round-trip cleanly.reddit.com/.redd.it) —evilreddit.comrejected_redact_specstrips both query and fragment from URLs before they enter exception messages or logstype="password";~/.content_parser/config.jsonand.streamlit/secrets.tomlare writtenchmod 600output/is in.gitignore(contains scraped comments, possibly PII)Roadmap (not in this PR)
cli jobs install-cron)Test plan
pip install -r requirements.txtsucceedspython -m content_parser.cli list-sources→youtube,instagram,redditpython -m unittest discover -s tests→ 102 passedpython -m content_parser.cli run --source youtube --video https://youtu.be/...python -m content_parser.cli run --source instagram --account nasa --set max_posts_per_input=5python -m content_parser.cli run --source reddit --subreddit python --set listing=top --set time_filter=weekpython -m youtube_parser.main --video URL --max-comments 10still worksstreamlit run app.pyshows source selector with all 3 plugins; tabs render dynamicallyhttps://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta